First Data Science study project - EDA & visualization

What do we want to answer?

First we gonna import the libraries

Let's take a look on the data

Here we can check all the columns/"variables" that are avaiable for the exploration, and start to see a putative path to follow.

- We can observe that the dataframe has 9800 rows and 18 columns.

We can observe that in "Postal Code" we have 11 null values.

- Since all the 11 null values are from Burlington city, at Vermont State, i am going to replace the values with a valid chosen Postal Code (5401);

- also, both Date columns data type were detected as string. We also want to convert them to datetime.

Null values treated.

Date columns data type treated

- I created two new columns so i can analyze the data period from different "frames" (Year or Months);

- i also created a "Days to Ship" column, with the period since the Order Date untill the start of the ship (Ship Date).

Removing columns i believe are not going to help:

- "Row ID", "Order ID", "Customer Name", "Country", "Postal Code" and "Product ID".

Question 1

How were the sales at the analyzed period?

Here we can observe that we have data from 2015/January to 2018/December

- The period total sales was 2,261,536.78 dollars;
- the whole period max order value was 22,638.48 dollars.

*We can also start to look at the sales by Year.

A more basic plot:

Here we can observe the numbers of orders by year:

- There's an increase trend in orders number every year. So, 2018 was the year with more orders.

And a more refined plot from sales by Year.

*you can observe that this plot already gives us the percentage of the difference in sales between one year and the year before.

- The sales decreased 4.26% in 2016, but increased 30.64% and 20.30% in 2017 and 2018, respectively;

- 2018 was the best year in sales, followed by 2017.

Now a look into the sales by month

- The best month in sales was 2018/November;

- every year the best month was November or December, except in 2015 (September);
    * We can start to observe that maybe theres a trend of growth in sales after 2016, and a similar seasonality in 
    sales over each year.

- The number of months where the sales were higher than the whole period sales median decreased from 4 to 3 between 2015 
and 2016. After that, this number increased to 7 and 10 in 2017 and 2018, respectively. So, in 2018 the sales were 
higher than this median almost the entire year, and in November the sales were higher than 3 times this median value.

Ship Mode:

- aparently the pattern observed here is the expected, with more people ordering by the standard class;

- sales sum by ship mode follows the same expected pattern - nothing calls attention.

Days to Ship:

- Max Days to Ship is seven - not bad. But could it be better? for First, Second and Standard classes.

Segment:

- The segment that orders the most is Consumer;

- the main Segment for the Sales is Consumer too.

Customer ID:

- We got 793 different customers;
- only 6 customers are "single-time buyers";
- Top 10 Customers by Sales - SM-20320 is the customer with the biggest revenue by a good margin from the second.

Location data - country Region:

- West and East are the best regions, respectively, in order numbers and in sales;

- Central and South are both the lowest order regions. South represents only 16% of the orders;
- And they are both the lowest sales regions.
    *Maybe we could raise the marketing there?

States:

- The states that ordered the most are California (West) and NY (East);

- And the sales by state follows a very similar order, with California being the state with the biggest revenue, 
followed by NY, Texas (a central state), Washington and Pennsylvania.

I first saw this beautiful USA "interactive" map done with Plotly in Samruddhi Mhatre EDA, available at Kagle, then i implemented it here:

- https://www.kaggle.com/code/samruddhim/part-1-exploratory-data-analysis

Cities:

- New York City is the city with the biggest revenue (more than 10% of the business revenue) and is also the city that 
orders the most, in both cases followed by Los Angeles;

- NYC and LA ordered more than the fourth State (Pennsylvania);
- and NYC and LA sold more alone than the entire Texas, the third state in total sales.
    *So, they are the most important cities to the business.

Question 2

Which was the most sold category?

* We have Category and Sub-Category data

Category:

- Tech is the category with the biggest revenue, even with Office Supplies being the most ordered category;
*Thats probably because the tech products sale price (much higher than office supplies).

- But theres a certain balance in revenue by Category. There's no category with a very discrepant sales compared to 
the others.

Sub-Category:

- Most ordered sub-categories are Binders and Papers (Office Supplies) - nothing calls attention;
- but the sub-categories with the biggest revenues are Phones (Technology) and Chairs (Furniture).

Question 3

Which was the top selling product?

- Most ordered products are Staple Envelope, Staples and Easy-staple paper (Office Supplies category);
    *All of them with more than double of the orders of the fourth item (also Office Supplies).
- Even not being in the 20 most ordered itens, Canon imageCLASS 2200 Advanced Copier is the first item in revenue, with more than double of the second item's revenue;

- none of the 10 itens with the biggest revenues are between the 20 most ordered itens!

Here we finish answering the questions that were made for us.

We could continue the exploration, but that's enough for what was asked from us.

Recapping the process and the main findings

How were the sales at the analyzed period?

Which was the most sold category?

Which was the top selling product?

Next steps

Since we already answered these questions, our work here would be finished. But we can go further and build a time series model to forecast the sales in the future. That model would use the sales behavior in the observed data, like trend and seasonality, to predict the behavior of this variable in an upcoming period. But this will be for another time.